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1.  Description  of  Program 


This  report  summarizes  accomplishments  of  the  ISIS  project  during  the  period  Feb  4,  1986  • 
May  4,  1986.  We  assume  that  the  reader  is  familiar  with  the  goals  of  the  project  and  has  read 
some  of  our  recent  progress  reports.  Accordingly,  die  summary  will  be  brief  and  targets,  to 
specific  accomplishments  made  during  this  period,  rather  than  the  overall  status  of  the  project. 

The  first  quarter  of  1986  represents  the  beginning  of  our  second  year  of  DARPA  funding, 


and  we  are  pleased  to  report  substantial  progress  in  several  important  areas.  Our  effort  is  now 
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focusing  on  m airing  the  technology  of  fault-tolerance  easier  to  use^and  stripping  as  much  unessen¬ 
tial  overhead  from  our  approach  as  we  can.  (jWe  believe  that  by  adopting  what  is  essentially  a 
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"RISC'  approach  to  software  fault-tolerance,  it  will  be  possible  to  address  a  broader  collection  of 
distributed  computing  problems  than  we  have  in  the  past,  making  our  work  useful  to  practitioners 


whose  applications  cannot  be  addressed  efficiently  using  our  currant  approach  (resilient  objects). 
Our  plan  is  to  develop  a  new  system  that  wQl  continue  to  provide  resilient  objects  at  a  high  level, 


but  will  also  indude  support  for  fault-tolerant  process  groups (  described  bdcr  it  a  lower  level. 


This  lower  level  wifi  be  directly  accessible  to  programmers,  and  much  of  our  own  software  wifi 


reside  within  it,  induding  a  collection  of  fault-tolerant  services  embodying  specialized  distributed 


algorithms,  such  asf-the&hared  memory  mechanism ^(tescribed  below. 
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In  the  subsections  that  follow,  we  first  summarize  activity  on  the  ISIS  prototype,  then  discuss 
the  new  system,  and  then  describe  some  of  the  other  activities  of  the  project. 
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1.1.  ISIS  Prototype  and  Application  Software 


- 


Since  completing  the  ISIS  prototype,  we  have  been  using  it  to  develop  application  software. 
This  presently  indudes  a  calendar  program,  described  in  our  previous  quarterly  status  report,  and 
a  distributed  monitoring  program  that  uses  ISIS  to  distribute  a  task  over  multiple  sites  and  then  to 
monitor  the  computation  while  it  is  underway,  reacting  dynamically  to  failures  and  other  events. 
It  is  interesting  to  note  that  both  of  these  programs  were  developed  largely  by  naive  programmers 
with  no  understanding  of  fault-tolerance  or  distributed  protocols.  In  addition  to  demonstrating 


how  ISIS  can  be  used,  these  applications  have  helped  us  debug  it.  because  the  prototype  now 
seems  quite  stable,  a  copy  has  been  made  available  to  colleagues  at  Berkeley,  where  Prof.  Domen¬ 
ico  Ferrari’*  group  is  considering  using  it  for  experimental  purposes.  We  will  continue  to  make 
this  version  of  ISIS  available  to  other  researchers  on  a  limited  basis. 

A  new  paper  on  ISIS  was  completed  during  the  report  period  [1].  In  addition  to  giving  a 
detailed  analysis  of  the  algorithms  used  in  the  system,  this  paper  describes  the  calendar  program 
and  the  techniques  used  to  develop  it,  and  presents  new  performance  data.  One  unexpected 
insight  resulted  from  this  performance  work.  We  discovered  that  using  our  concurrent  update 
techniques  [2]  [3],  updates  of  replicated  data  con  actually  be  cheaper  than  updates  to  non- 
rep.’icsted  data.  This  is  counterintuitive:  one  would  have  expected  that  a  computer  system  must  do 
more  work  to  maintain  multiple  copies  of  a  data  item  than  to  maintain  just  a  single  copy,  and 
hence  performance  should  degrade  as  the  number  of  copies  increases.  In  fact,  ISIS  operates  more 
efficiently  under  moderate  distributed  loads,  for  two  reasons.  First,  our  approach  divides  updates 
to  replicated  objects  into  local  and  remote  computational  activity,  and  the  remote  part  is  much 
cheaper  than  the  remote  part.  Thus,  when  the  local  work  for  a  collection  of  operations  is  distri¬ 
buted  over  multiple  sites,  a  given  site  ends  up  doing  less  work  than  if  it  had  to  do  everything  by 
itself.  Moreover,  a  significant  amount  of  “piggybacking”  occurs  when  the  system  operates  this 
way  (that  is,  a  typical  communications  packet  carries  multiple  messages,  not  just  one  message). 
Since  I/O  overhead  is  a  significant  cost  factor  in  ISIS,  this  means  that  it  becomes  less  expensive  to 
read  a  typical  message,  hence  efficiency  rises  until  the  maximum  level  of  piggybacking  is  reached. 
The  net  impact  of  these  two  effects  is  that  performance  improves  when  objects  are  replicated  to 
small  numbers  of  sites,  for  moderate  request  loads  presented  randomly  at  all  sites.  Specifically, 
wc  observed  improved  performance  for  objects  distributed  to  as  many  as  six  sites  and  subjected  to 
request  loads  of  up  to  10  operations  per  second  -  very  respectable  for  SUN  2  workstations  run¬ 
ning  UNIX.  Improvements  in  both  the  average  response  time  and  the  maximum  number  of 
operations  the  object  could  perform  per  second  were  noted.  These  results  supports  our  belief  that 
what  we  are  doing  in  ISIS  will  be  valuable  in  a  wide  range  of  distributed  computing  projects  in 
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1.2.  Fault-tolerant  process  groups 


The  crux  of  our  preseat  effort  is  to  develop  a  system  that  will  support  fault-tolerant  process 
groups ,  an  idea  which  we  first  reported  in  [4]  and  elaborate  on  in  [5]  (a  copy  of  which  is  attached). 
Such  a  group  comists  of  a  set  of  processes  that  cooperate  to  implement  some  fault-tolerant  distri¬ 
buted  service.  In  the  case  of  a  resilient  object,  the  group  members  are  the  components  of  the 
object,  and  implement  a  coordinator-cohort  algorithm  to  provide  fault-tolerant  processing  of  client 
requests  [1].  However,  the  process  group  approach  also  simplifies  a  wide  range  of  other  prob¬ 
lems,  ranging  from  distributed  process  control  software  (i.e.  to  control  critical  tasks  within  a 
power  plant  or  spacecraft)  to  more  conventional  distributed  computing  tasks,  such  as  dynamic 
reconfiguration  of  a  distributed  program.  We  are  making  rapid  progress  on  an  implementation  of 
this  approach,  and  will  soon  complete  the  lowest  levels  of  a  new  system  providing  for  fault- 
tolerant  inter-site  and  inter-process  communication  support  based  on  the  algorithms  given  in  our 
papers.  Once  this  layer  has  been  completed,  we  expect  to  have  higher  levels,  which  implement 
the  process  group  abstraction,  working  very  rapidly.  Our  initial  work  is  being  done  using  UNIX, 
but  we  are  minimizing  our  dependence  on  UNIX-specific  features  in  the  expectation  that  UNIX 
will  eventually  be  replaced  by  some  new  operating  system. 

1.3.  New  forms  of  butt-tolerant  objects 

A  forthcoming  paper  will  report  some  recent  work  of  ours  on  mechanisms  for  supporting 
fault-tolerant  objects  that  provide  predictable  behavior  in  the  presense  of  concurrency  and  failures, 
but  without  the  overhead  of  a  transactional  access  mechanism.  Such  an  object  is  best  viewed  as  a 
“shared  memory'’,  implemented  using  message  passing  in  a  way  that  provides  predictable  behavior 
and  eliminates  both  the  need  for  higher  level  synchronization  and  for  special  code  to  handle 
failures.  Since  these  are  common  sources  of  complexity  in  distributed  software,  users  of  these 
facilities  can  build  fault-tolerant  distributed  programs  without  being  particularly  sophisticated 


about  distributed  computing.  On  the  other  hand,  since  access  to  the  shared  memory  is  not  tran¬ 
sactional,  much  higher  performance  can  be  achieved  than  using  resilient  objects.  Moreover, 
shared  memories  can  be  used  for  interprocess  communication  in  ways  that  are  awkward  to  express 
using  resilient  objects.  Thus,  the  approach  promises  to  provide  a  cheap,  easily  used,  alternative  to 
resilient  objects.  We  plan  to  indude  software  support  for  this  approach  as  a  component  of  the 
fault-tolerant  process  group  system  we  are  now  building. 

1.4.  Other  areas  of  activity 

Research  on  techniques  for  applying  our  work  to  parallel  software  and  methods  for  tolerat¬ 
ing  network  partitioning  continues.  In  addition,  some  of  the  new  graduate  students  who  have 
joined  the  project  are  starting  to  explore  problems  in  fault-tolerant  process  control  and  high  level 
operating  systems  software  based  on  the  fault-tolerant  process  group  approach.  We  will  have 
more  to  say  about  work  in  all  these  areas  during  the  next  few  months. 

2.  Project  Personnel 

The  ISIS  project  has  been  successful  in  attracting  some  very  strong  new  graduate  students, 
largely  because  of  a  distributed  computing  course  that  Birman  taught  dining  Spring  1986.  In  fact, 
the  department  as  a  whole  has  become  an  extremely  active  research  area,  and  now  indudes  six 
faculty  members  with  interests  in  areas  relating  to  ISIS.  There  have  been  no  changes  in  the  key 
pcisonnel  of  the  project,  which  continues  to  be  run  by  Prof.  Birman  with  the  help  of  Eh.  T. 
Joseph. 

3.  Travel 

Several  members  of  the  ISIS  project  attended  the  Aidomar  workshop  on  fault-tolerant  distri¬ 
buted  computing  in  Asilomar,  California  during  March  1986.  Prof.  Birman  presented  a  paper  at 
this  workshop.  Afterwards,  he  visited  the  IBM  San  Jose  Research  Center,  Cheriton’s  V  group  at 
Stanford  University,  and  Ferrari’s  DASH  group  at  Berkeley.  Other  trip  (to  the  Universities  of 
Rochester  and  Toronto)  did  not  use  project  funds.  Additionally,  graduate  student  A.  El  Abbadi 


presented  a  paper  cm  techniques  for  tolerating  network  partitioning  at  the  March  Principles  of 
Database  Systems  conference  in  Boston. 
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4.  Bodfet  summary 


We  conclude  with  a  summary  of  the  financial  status  of  the  project,  which  is  dose  to  projec¬ 


tions  in  all  categories. 


Expenditures  •  2/5/86  •  5/4/86 


Secretary  support 

Planned  budget 
for  period 
544 

Summer  faculty 

Research  Assodate 

8,700 

Graduate  students 

18,633 

Employee  benefits 

2,635 

Computer  maintenance 

769 

Publications 

327 

Supplies 

253 

Computer  Supplies 

154 

Travel 

2,000 

Programmer 

Equipment 

Indirect  cost 

10,981 

Totals 

44,996 

Expenses 

Prior 

Total 

for  period 

Expenses 

to  11/4/85 

584 

484 

1028 

-0- 

20,609 

20.6C7 

8,700 

6,444 

15,144 

18,633 

77,406 

96,039 

2,635 

4,035 

6,670 

1,000 

4,025 

5,025 

917 

1,457 

2,374 

419 

3134 

3553 

-0- 

643 

643 

1,933 

11,170 

13,103 

1,006 

1,006 

67,993 

67,993 

11,468 

62,578 

74,046 

46,249 

260,984 

307,233 
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ABSTRACT 


We  describe  a  collection  of  communication  primitives  integrated  with  a  mechanism  for  han¬ 
dling  process  failure  and  recovery.  These  primitives  facilitate  the  implementation  of  fault-tolerant 
process  groups,  which  can  be  used  to  provide  distributed  services  in  an  environment  subject  to 
non-malidous  trash  failures. 


1.  Introductiou 

At  Cornell,  we  recently  completed  a  prototype  of  thi  MS  syitem,  which  transforms  abstract 
type  specifications  into  fault-tolerant  distributed  implementations,  while  insulating  users  from  the 
mechanisms  by  which  fault-tolerance  is  achieved  [Birman- a],  A  wide  range  of  reliable  communica¬ 
tion  primitives  have  been  proposed  in  the  literature,  and  we  became  convinced  that  by  using  such 
primitives  when  building  the  ISIS  system,  complexity  could  be  avoided.  Unfortunately,  the  exist¬ 
ing  protocols,  which  range  from  reliable  and  atomic  broadcast  [Chang]  [Cristian]  [Schneider]  to 
Byzantine  agreement  [Strong],  either  do  not  satisfy  the  ordering  constraints  required  for  many 
fault-tolerant  applications  or  satisfy  a  stronger  constraint  than  necessary  at  too  high  a  cost.  In  par¬ 
ticular,  these  protocols  have  not  attempted  to  minimize  the  latency  (delay)  incurred  before  mes¬ 
sage  delivery  can  occur.  In  ISIS,  latency  appears  to  be  a  major  factor  that  limits  performance. 
Fault-tclerant  distributed  systems  also  need  a  way  to  detect  failures  and  recoveries  consistently, 
ami  we  found  that  this  could  be  integrated  into  the  communication  layer  in  a  manner  that  reduces 
the  synchronization  burden  on  higher  level  algorithms.  These  observations  motivated  the  develop¬ 
ment  of  a  new  collection  of  primitives,  which  we  present  below 

’This  work  was  supported  by  the  Defense  Advanced  Research  Projects  Agency  fDoD)  under  ARPA  order  5378, 
Contract  MDA903-85-C-0124,  and  by  the  National  Science  Foundation  under  grant  DCR-84 12582.  The  views,  opinions 
and  findings  contained  in  this  repeat  are  those  of  the  authors  and  should  not  be  construed  as  an  official  Deportment  of 
Defense  position,  policy,  or  decision. 


Our  broadcast  primitives  are  designed  to  respect  several  sorts  of  ordering  constraints,  and 
have  cost  and  latency  that  varies  depending  on  the  nature  of  the  constraint  required  [Birman- b] 
[loseph-a]  [Joseph-b].  Failure  and  recovery  are  integrated  into  the  communication  subsystem  by 
treating  these  events  as  a  special  sort  of  broadcast  issued  on  behalf  of  a  process  that  has  failed  or 
recovered  The  primitives  are  presented  in  the  contest  of  fault  tolerant  process  g  jups:  groups  of 
processes  that  cooperate  to  implement  some  distributed  algorithm  or  service,  and  which  need  to 
see  consistent  orderings  of  system  event:  in  order  to  achieve  mutually  consistent  behavior.  Our 
primitives  provide  flexible,  inexpensive  support  for  process  groups  of  this  sort.  By  using  these 
primitives,  the  ISIS  system  achieved  both  high  levels  of  concurrency  and  suprisingly  good  perfor¬ 
mance.  Equally  important,  its  structure  was  made  suprisir.gly  simple,  making  it  feasible  to  reason 
about  the  correctness  of  our  algorithms. 

In  the  remainder  of  this  paper  we  sumarize  the  issues  and  alternatives  that  the  designer  of  a 
distributed  system  is  presented  with,  focusing  on  two  styles  of  support  for  fault-tolerant  comput¬ 
ing:  remote  procedure  calls  coupled  with  a  transactional  execution  facility,  and  the  fault- tolerant 
process  group  mechanism  mentioned  above.  Next,  our  primitives  are  described.  We  conclude  by 
speculating  on  future  directions  in  which  this  work  might  be  taken. 

2.  Goals  and  assumptions 

The  difficulty  of  constructing  fault-tolerant  distributed  software  can  be  traced  to  a  number  of 
interrelated  issues.  The  iist  that  follows  is  not  exhaustive,  but  attempts  to  touch  on  the  principal 
considerations  that  must  be  addressed  in  any  such  system: 

1.  Synchronization.  Distributed  systems  offer  the  potential  for  large  amounts  of  concurrency, 
and  it  is  usually  desirable  to  operate  at  as  high  a  level  of  concurrency  as  possible.  However, 
when  we  move  from  a  sequential  execution  environment  to  a  concurrent  one,  it  becomes 
necessary  to  synchronize  actions  that  may  conflict  ir  their  access  to  shared  data  or  entail 
communication  with  overlapping  sets  of  processes.  Additional  problems  that  can  arise  in  this 
context  include  deadlock  avoidance  or  detection,  iiveiock  avoidance,  etc. 


2. 


Fault  detection.  It  is  usually  necessary  for  a  fault-tolerant  application  to  have  a  consistent 
picture  of  which  components  fail,  and  in  what  order.  Timeout,  the  most  common  mechanism 
for  detecting  failure,  is  unsatisfactory',  because  there  are  many  situations  in  which  a  healthy 
component  can  timeout  with  respect  to  one  component  without  this  being  detected  by  some 
another.  Failure  detection  under  more  rigorous  requirements  requires  an  agreement  proto¬ 
col  that  is  related  to  Byzantine  agreement  [Strong]  [Hadzilacos]. 

3.  Consistency.  When  a  group  of  processes  cooperate  in  a  distributed  system,  it  is  necessary  to 
ensure  that  the  operational  processes  have  consistent  views  of  the  state  of  the  group  as  a 
whole.  For  example,  if  process  p  believes  that  some  property  P  holds,  and  on  the  basis  of 
this  interacts  with  process  q,  the  state  of  q  should  not  contradict  the  fact  that  p  believes  P  to 
be  true.  This  problem  is  closely  related  to  notions  of  knowledge  and  consistency  in  distri¬ 
buted  systems  [Halpern]  [Lamport].  In  our  contest,  P  will  often  be  the  assertion  that  a 
broadcast  has  been  received  by  q,  or  that  q  saw  some  sequence  of  events  occur  in  the  same 
order  as  did  p. 

4.  Serial  liability.  Many  distributed  systems  are  partitioned  into  data  manager  processes,  which 
implemented  shared  variables,  and  transaction  manager  processes,  which  issue  series  of 
requests  to  data  managers  [Bernstein].  If  transaction  managers  can  execute  concurrently,  it 
is  often  desirable  to  ensure  that  transactions  produce  serializable  outcomes  [Eswaren]  [Papa- 
diraitrou].  Serializability  is  increasingly  viewed  as  an  important  property  in  “object- 
oriented”  distributed  systems  that  package  services  as  abstract  objects  with  which  clients 
communicate  by  remote  procedure  calls  (RFC).  On  the  other  hand,  there  are  systems  for 
which  serializability  Ls  cither  too  strong  a  constraint,  or  simply  inappropriate. 

Jointly,  these  problems  render  the  design  of  fault-tolerant  distributed  software  daunting. 
The  correctness  of  any  proposed  design  and  of  its  implementation  become  serious  not  insur¬ 
mountable,  concerns.  We  face^i  this  range  of  problems  in  our  work  on  the  ISIS  system,  and 
rapidly  became  convinced  that  in  the  absence  of  some  systematic 


P'-ge  3 


-rver  be  conrtrv''*1*^  In  Sf*  6.  w>li  show  how  the  primitives  of  See.  5  provide  such  an 

The  failure  model  that  one  adopts  has  considerable  impact  on  the  structure  of  the  resulting 
system.  We  adopted  the  model  of  fail-stop  processors  [Schneider]:  when  failures  occur,  a  proces¬ 
sor  simply  stops  (crashes),  as  do  all  the  processes  executing  on  it.  We  rejected  the  extremely  pes¬ 
simistic  assumptions  of  the  malicious  Byzantine  failure  models  because  they  lead  to  slower,  more 
redundant  software,  and  because  the  probability  that  a  system  failure  will  be  undetectably  mali¬ 
cious  seems  vanishingly  small  in  practice.  Work  based  on  Byzantine  assumptions  is  described  in 
[Lamport]  and  [Schlicting].  We  also  assume  that  the  communication  network  is  reliable  but  sub¬ 
ject  to  unbounded  delay.  Although  network  partitioning  is  an  important  problem,  we  do  not 
address  it  here. 

Further  assumptions  are  sometimes  made  about  the  availability  of  synchronized  realtime 
docks.  Here,  we  adopt  the  position  that  although  reasonably  accurate  elapsed- time  docks  are  nor¬ 
mally  available,  dosely  synchronized  dodcs  frequently  are  not.  For  example,  the  60Hz  “line’' 
docks  commonly  used  on  current  workstations  are  only  accurate  to  16ms.  On  the  other  hand,  4- 
8ms  inter-site  message  transit  tunes  are  common  and  l-2rns  are  reported  increasingly  often.  Thus, 
it  is  impossible  to  synchronize  docks  to  better  than  32-48ms,  enough  time  for  a  pair  of  sites  to 
exchange  between  4  and  50  messages.  Thus,  wc  assume  that  dock  skew  is  “large”  compared  to 
inter-site  message  latency. 

3.  Alternatives 

Two  different  approaches  to  reliable  distributed  computing  have  become  predominant.  The 
first  approach  involves  the  provision  of  a  communication  primitive,  such  as  atomic  broadcast, 
which  can  be  used  as  the  framework  on  which  higher  level  algorithms  are  designed.  Such  a  primi¬ 
tive  seeks  to  deliver  messages  reliably  to  some  set  of  destinations,  despite  the  possiV  y  that 
failures  might  occur  during  the  execution  of  the  protocol.  We  term  this  the  process  group 
approach,  since  it  lends  itself  to  the  organization  of  cooperating  processes  into  groups,  as 
described  in  the  introduction.  Process  groups  are  an  extremely  flexible  abstraction,  and  have  been 
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employed  in  the  V  Kernel  [Cheriton]  as  well  in  the  ISIS  system.  The  idea  of  using  process  groups 
to  address  the  problems  raised  in  the  previous  section  seems  to  be  new. 

A  higher  level  approach  is  to  provide  mechanisms  for  transactional  interactions  between 
processes  that  communicate  using  remote  procedure  calls  [Birrellj.  This  has  lead  to  work  on 
nested  transactions  (due  to  nested  RFCs)  [Moss],  support  for  transactions  at  the  language  level 
[Liskov],  transactions  within  an  operating  systems  kernel  [Spector]  [Allchin]  [Fopek]  [Lazowska], 
and  transactional  access  to  higher-level  replicated  services,  such  as  resilient  objects  in  ISIS  or  rela¬ 
tions  in  database  systems.  The  primitives  in  a  transactional  system  provide  mechanisms  for  distri¬ 
buting  the  request  tivat  initiates  the  transaction,  accessing  data  (which  may  be  replicated),  perform¬ 
ing  concurrency  control,  and  implementing  commit  or  abort.  Additional  mechanisms  are  normally 
needed  for  orphan  termination,  deadlock  detection,  etc.  The  issue  then  arises  '-f  how  these 
mechanisms  should  themselves  be  implemented.  Our  work  in  ISIS  leads  us  to  believe  that  transac¬ 
tions  are  easily  implemented  on  top  of  fault-tolerant  process  groups;  lacking  such  a  mechanism  a 
number  of  complicated  protocols  are  needed  and  the  associated  system  support  can  be  substantial. 
Moreover,  transactions  represent  a  relatively  heavy-v’Hght  solution  to  the  problems  surveyed  in 
the  previous  section.  We  now  believe  ♦hat  transactions  are  inappropriate  for  casual  interactions 
between  processes  in  typical  distributed  systems.  The  remainder  of  this  paper  is  therefore  focused 
on  the  process  group  approach. 

4.  Existing  broadcast  primitives 

The  considerations  outlined  above  motivated  us  to  examine  reliable  broadcast  primitives. 
Previous  work  has  been  reported  on  this  problem,  under  assumptions  comparable  with  those  of 
Sec.  2,  and  we  begin  by  surveying  this  research.  In  [Schneider],  an  implementation  of  a  reliable 
broadcast  primitive  is  described.  Such  a  primitive  ensures  that  a  designated  message  will  be 
transmitted  from  one  site  to  all  other  operational  sites  in  a  system;  if  a  failure  occurs  but  any  site 
has  received  the  message,  all  will  eventually  do  so.  [Chang]  and  [Cristian]  describe  implementa¬ 
tions  for  atomic  broadcast,  which  is  a  reliable  broadcast  with  the  additional  property  that  messages 
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are  delivered  in  the  same  OTder  at  all  overlapping  destinations,  and  this  order  preserves  the 
transmission  order  if  messages  originate  in  a  single  site. 

Atomic  broadcast  is  a  powerful  abstraction,  and  essentially  the  same  behavior  is  provided  by 
one  of  the  primitives  we  discuss  in  the  next  section.  However,  it  has  several  drawbacks  which 
made  us  hesitant  to  adopt  it  as  the  only  primitive  in  the  system.  Most  serious  is  the  latency  that  is 
incurred  in  order  to  satisfy  the  delivery  ordering  property.  Without  delving  into  the  implementa¬ 
tions,  which  are  based  on  a  token  scheme  in  [Chang]  and  an  acknowledgement  protocol  in 
[Schneider],  we  observe  that  the  delaying  of  certain  messages  is  fundamental  to  the  establishment 
of  a  unique  global  delivery  ordering;  indeed,  it  is  easy  to  prove  that  this  must  always  be  the  case. 
In  [Chang]  a  primary  goal  is  to  minimize  the  number  of  messages  sent,  and  the  protocol  given 
performs  extremely  well  in  this  regard.  However,  a  delay  occurs  while  waiting  for  tokens  to 
arrive  and  the  delivery  latency  that  results  may  be  high.  [Cristian]  assumes  that  docks  are  dosely 
synchronized  and  that  message  transit  times  are  bounded  by  well-known  constants,  and  uses  this  to 
derive  atomic  broadcast  protocols  tolerant  of  increasingly  severe  dasses  of  failures.  The  protocols 
expliatly  delay  delivery  to  achieve  the  desired  global  ordering  on  broadcasts.  Hence  for  poorly 
synchronized  docks  (which  are  typical  of  existing  workstations),  latency  would  be  high  in  com¬ 
parison  to  inter-site  message  transit  times. 

Another  drawback  of  the  atomic  broadcast  protocols  is  that  no  mechanism  is  provided  for 
ensuring  that  all  processes  observe  the  same  sequence  of  failures  and  recoveries,  or  for  ensuring 
that  failures  and  recoveries  are  ordered  relative  to  ongoing  broadcasts.  We  dcx-Jed  to  look  more 
dosely  at  these  issues. 

5.  Our  broadcast  primitives 

We  now  describe  three  broadcast  protocols  •  GBCAST,  BCAST,  and  OBCAST  •  for  transmit¬ 
ting  a  message  reliably  from  a  sender  orocess  to  seme  set  of  destination  processes.  Details  of  the 
protocols  and  their  correctness  proofs  can  be  found  in  [Birman-b].  The  protocols  ensure  “all  or 
nothing”  behavior;  if  any  destination  receives  a  message,  then  unless  it  fails,  all  destinations  will 


receive  it. 


5.1.  The  GBCAST  primitive 

GBCAST  (group  broadcast)  is  the  most  constrained,  and  costly,  of  the  three  primitives.  It  is 
used  to  transmit  information  about  failures  and  recoveries  to  members  of  a  process  group.  A 
recovering  member  uses  GBCAST  to  inform  the  operational  ones  that  it  has  become  available. 
Additionally,  when  a  member  fails,  the  system  arranges  for  a  GBCAST  to  be  issued  to  group 
members  on  its  behalf,  informing  them  of  its  failure.  Arguments  to  GBCAST  are  a  message  and  a 
process  group  identifier,  which  is  translated  into  a  set  of  destinations  as  described  below  (Sec. 
5.6). 

Our  GBCAST  protocol  ensures  that  if  any  process  receives  a  broadcast  B  before  receiving  a 
GBCAST  G,  then  all  overlapping  destinations  will  receive  B  before  G.  This  is  true  regardless  of 
the  type  of  broadcast  involved.  Moreover,  when  a  failure  occurs,  the  corresponding  GBCAST 
message  is  delivered  after  any  other  broadcasts  from  the  failed  process.  Each  member  can  there¬ 
fore  maintain  a  view  listing  the  membership  of  the  process  group,  updating  it  when  a  GBCAST  is 
received.  Although  views  are  not  updated  simultaneously  in  real  time,  all  members  observe  the 
same  sequence  of  view  changes.  Since,  GBCAST' s  are  ordered  relative  to  all  other  broadcasts,  all 
members  receiving  a  given  broadcast  will  have  the  same  value  of  view  when  they  receive  it.1 
Members  of  a  process  group  can  use  this  value  to  pick  a  strategy  for  processing  an  incoming 
request,  or  to  react  to  failure  or  recovery  without  having  to  run  any  special  protocol  first.  Since 
the  GBCAST  ordering  is  the  same  everywhere,  their  actions  will  all  be  consistent.  Notice  that 
when  all  the  members  of  a  process  group  may  have  failed,  GBCAST  also  provides  an  inexpensive 
way  to  determine  the  last  site  that  failed:  process  group  members  simply  log  each  new  view  that 
becomes  defined  on  stable  storage  before  using  it;  a  simplified  version  of  the  algorithm  in  [Skeen- 

lA  problem  arises  if  a  process  p  fails  without  receiving  seme  message  after  that  message  has  already 
been  delivered  to  some  other  process  q:  q'l  view  when  it  received  the  message  would  show  p  to  be  operation¬ 
al;  hence,  q  will  assume  that  p  received  the  message,  although  p  is  physically  incapable  of  doing  so.  Howev¬ 
er,  the  state  of  the  system  is  now  equivalent  to  one  in  which  p  did  receive  the  message,  but  failed  before  act¬ 
ing  an  it.  In  effect,  there  exists  an  interpretation  of  the  actual  system  state  that  is  consistent  with  q' s  as- 


a]  can  then  be  executed  when  recovering  from  failure. 


5.2.  The  BCAST  primitive 

The  GBCAST  primitive  is  too  costly  to  be  used  for  general  communication  between  process 
group  members.  This  motivates  the  introduction  of  weaker  (leas  ordered)  primitives,  which  might 
be  used  in  situations  where  a  total  order  on  broadcast  messages  is  not  necessary.  Our  second 
primitive,  BCAST,  satisfies  such  a  weaker  constraint.  Specifically,  it  is  often  desired  that  if  two 
broadcasts  are  received  in  some  order  nt  a  common  destination  site,  they  be  received  in  that  order 
at  all  other  common  destinations,  even  if  this  order  was  not  predetermined.  For  example,  if  a 
process  group  is  being  used  to  maintain  a  replicated  queue  and  BCAST  is  used  to  transmit  queue 
operations  to  all  copies,  the  operations  will  be  done  in  the  same  order  everywhere,  hence  the 
copies  of  the  queue  will  remain  mutually  consistent.  The  primitive  BCAST(msg ,  label,  dests), 
where  msg  is  the  message  and  label  is  a  string  of  characters,  provides  this  behavior.  Two  BCASTs 
having  the  same  label  are  delivered  in  the  same  order  at  all  common  destinations.  On  the  other 
hand,  BCAST’s  with  different  labels  can  be  delivered  in  arbitrary  order,  and  since  BCAST  is  not 
used  to  propagate  information  about  failures,  no  flushing  mechanism  is  needed.  The  relaxed  syn¬ 
chronization  results  in  lower  latency. 

5.3.  The  OBCAST  primitive 

Our  third  primitive,  OBCAST  (ordered  broadcast),  is  weakest  in  the  sense  that  the  it  involves 
less  distributed  synchronization  then  GBCAST  or  BCAST.  OBCAST(msg ,  dests )  atomically  delivers 
msg  to  each  operational  dest.  If  an  OBCAST  potentially  causally  dependent  on  another,  then  the 
former  is  delivered  after  the  latter  at  all  overlapping  destinations.  A  broadcast  B2  is  potentially 
causally  dependent  on  a  broadcast  Bx  if  both  broadcasts  originate  from  the  same  process,  and  B2  is 
sent  after  Blt  or  if  there  exists  a  chain  of  message  transmissions  and  receptions  or  local  events  by 
which  knowledge  could  have  been  transferred  from  the  process  that  issued  B;  to  die  process  that 


sumption. 


issued  3 2  [Lamport].  For  causally  independent  broadcasts,  the  deliver  ordering  is  not  constrained. 

OBCAST  is  valuable  in  systems  like  ISIS,  where  concurrency  control  algorithms  are  used  to 
synchronize  concurrent  computations.  In  these  systems,  if  two  processes  communicate  con* 
currently  with  the  same  process  the  messages  are  almost  always  independent  ones  that  can  be  pro¬ 
cessed  in  any  order:  otherwise,  concurrency  control  would  have  caused  one  to  pause  until  the  other 
was  finished.  On  the  other  hand,  order  is  dearly  important  within  a  causally  linked  series  of 
broadcasts,  and  it  is  precisely  this  sort  of  order  that  OBCAST  respeen. 

5.4.  Other  broadcast  primitives 

A  weaker  broadcast  primitive  is  reliable  broadcast,  which  provides  all-or-nothing  delivery, 
but  no  ordering  properties.  The  formulation  of  OBCAST  in  [Birman- b]  actually  indudes  a 
mechanism  for  performing  broadcasts  of  this  sort,  hence  no  special  primitive  is  needed  for  the 
purpose.  Additionally,  there  may  be  situations  in  which  BCAST  protocols  that  also  satisfy  an 
OBCAST  ordering  property  would  be  valuable.  Although  our  BCAST  primitive  could  be  changed 
to  respect  such  a  rule,  when  we  considered  the  likely  uses  of  the  primitives  it  seemed  that  BCAST 
was  better  left  completely  orthogonal  to  OBCAST.  In  situations  needing  hybrid  ordering  behavior, 
the  protocols  of  [Birman-b]  could  easily  be  modified  to  implement  BCAST  in  terms  of  OBCAST, 
and  the  resulting  protocol  would  behave  as  desired. 

5.5.  Synchronous  versus  asynchronous  broadcast  abstractions 

Many  systems  employ  RPC  internally,  as  a  lowest  level  primitive  for  interaction  between 
processes.  It  should  be  evident  that  all  of  our  broadcast  primitives  can  be  used  to  implement 
replicated  remote  procedure  calls  [Cooper]:  the  caller  would  simply  pause  until  replies  have  been 
received  from  all  the  participants  (observation  of  a  failure  constitutes  a  reply  in  this  case).  We 
term  such  a  use  of  the  primitives  synchronous,  to  distinguish  it  from  from  an  asynchronous  broad¬ 
cast  in  which  no  replies,  or  just  one  reply,  suffices. 
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In  our  work  on  ISIS,  GBCAST  and  BCAST  are  normally  invoked  synchronou'iy,  to  imple¬ 
ment  a  remote  procedure  call  by  one  member  of  an  object  on  all  the  members  of  its  process 
group.  However,  OBCAST,  which  is  the  most  frequently  used  overall,  is  almost  never  invoked 
synchronously.  Asynchronous  OBCASTi  are  the  source  of  most  concurrency  in  ISIS:  although  the 
delivery  ordering  is  assured,  transmission  can  be  delayed  to  enable  a  message  to  be  piggybacked 
on  another,  or  to  schedule  10  within  the  system  as  a  whole.  While  the  system  cannot  defer  an 
asynchronous  broadcast  indefinitely,  the  ability  to  defer  it  a  little,  without  delaying  some  computa¬ 
tion  by  doing  so,  permits  load  to  be  smoothed  Since  OB  CAST  respects  the  delivery  orderings  on 
which  a  computation  might  depend,  and  is  ordered  with  respect  to  failures,  the  concurrency  intro¬ 
duced  does  not  complicate  higher  level  algorithms.  Moreover,  the  protocol  itself  is  extremely 
cheap. 

A  problem  is  introduced  by  our  decision  to  allow  asynchronous  broadcasts:  the  atomic  recep¬ 
tion  property  must  now  be  extended  to  address  causally  related  sequences  of  asynchronous  mes¬ 
sages.  If  a  failure  were  to  result  in  some  broadcasts  being  delivered  to  all  their  destinations  but 
others  that  precede  them  not  being  delivered  anywhere,  inconsistency  might  result  even  if  the  des¬ 
tinations  do  not  overlap.  We  therefore  extend  the  atomicity  property  as  follows.  If  process  r 
receives  a  message  m  from  process  s,  and  s  subsequently  fails,  then  unless  r  fails  as  well,  m’  must 
be  delivered  to  its  remaining  destinations.  This  is  because  the  state  of  t  may  depend  on  any  mes¬ 
sage  m'  received  by  s  before  it  sent  m.  The  costs  of  the  protocols  are  not  affected  by  this  change. 

A  second  problem  arises  when  the  user-level  implications  of  this  atomicity  rule  are  con¬ 
sidered.  In  the  event  of  a  failure,  any  suffix  of  a  sequence  of  aysnchroncus  broadcasts  could  be 
lost  and  the  Systran  state  would  still  be  internally  consistent.  A  process  that  is  about  to  take  some 
action  that  may  leave  an  externally  visible  side-effect  will  need  a  way  to  pause  until  it  is 
guaranteed  that  surh  broadcasts  have  actually  been  delivered.  For  this  purpose,  a  (huh  primitive 
is  provided.  Occasional  calls  to  flush  do  not  eliminate  the  benefit  of  using  OBCAST  asynchro¬ 
nously.  Unless  the  system  has  built  up  a  considerable  backlog  of  undelivered  broadcast  messages, 


which  should  be  rare,  flush  will  only  pause  while  transmission  of  the  last  few  broadcasts  com¬ 
pletes. 

5.6.  Group  addressing  protocol 

Since  group  membership  can  change  dynamically,  it  may  be  difficult  for  a  process  to  com¬ 
pute  a  list  of  destinations  to  which  a  message  should  be  sent,  for  example,  as  is  needed  to  oerform 
a  GBCAST.  In  [Birman-b]  we  report  on  a  protocol  for  ensuring  that  a  given  broadcast  will  be 
delivered  to  all  members  of  a  process  group  in  the  same  view.  This  view  is  either  the  view  that 
was  operative  when  the  message  transmission  was  initiated,  or  a  view  that  was  defined  subse¬ 
quently.  The  algorithm  is  a  simple  iterative  one  that  costs  nothing  unless  the  group  membership 
changes,  and  permits  the  caching  of  possibly  inaccurate  membership  information  near  processes 
that  might  want  to  communicate  with  a  group.  Using  the  protocol,  a  flexible  message  addressing 
scheme  can  readily  be  supported. 

5.7.  Example 

Figure  1  illustrates  a  pair  of  computations  interacting  with  a  process  group  while  its  member¬ 
ship  changes  dynamically.  One  client  issues  a  pair  of  OBC AST's,  then  uses  BCAST  to  perform  a 
third  request  on  the  group.  A  second  client  interacts  only  once,  using  BCAST.  Note  that  unless 
the  first  client  invoked  flush  before  issuing  the  BCAST,  the  BCAST  might  be  received  before  the 
prior  OBCAST's  at  some  sites.  Arrows  showing  reply  messages  have  been  omitted  to  simplify  the 
figure,  but  it  would  normally  be  the  case  that  one  or  more  group  members  reply  to  each  request. 

6.  Using  the  primitives 

The  reliable  communication  primitives  described  above  dramatically  simplify  the  solution  of 
the  problems  dted  in  Sec.  2: 

1.  Synchronization.  Many  synchronization  problems  are  subsumed  into  the  primitives  them¬ 
selves.  For  example,  consider  the  use  of  GBCAST  to  implement  recovery.  A  recovering 
process  would  issue  a  GBCAST  to  the  process  group  members,  requesting  that  state 


Client  Computations 


Paroup  View 


Figure  1:  Client  processes  Interacting  with  a  process  group 


information  be  transferred  to  it.  In  addition  to  sending  the  current  state  of  the  group  to  the 
recovering  process,  group  members  update  the  process  group  view  at  this  time.  Subsequent 
messages  to  the  group  will  be  delivered  to  the  recovered  process,  with  all  necessary  syn¬ 
chronization  being  provided  by  the  ordering  properties  of  CBC\ST.  In  situations  where 
other  forms  of  synchronization  are  needed,  BCAST  provides  a  simple  way  to  ensure  that 
several  processes  take  actions  in  the  same  order,  and  this  form  of  low-level  synchronization 
simplifies  a  number  of  higher-level  synchronization  problems.  For  example,  if  BCAST  is 
used  to  request  write-locks  from  lock-manager  processes,  two  write-lock  requests  on  the 
same  item  can  never  deadlock  by  being  granted  in  different  orders  by  a  pair  of  managers. 

2.  Fault  detection.  Consistent  failure  (and  recovery)  detection  are  trivial  using  our  primitives:  a 
process  simply  waits  for  the  appropriate  process  group  view  to  change.  This  facilitates  the 
implementation  of  algorithms  in  which  one  processes  monitors  the  status  of  another  process. 
A  process  that  acts  on  the  basis  of  a  process  group  view  change  does  so  with  the  assurance 
that  other  group  members  will  (eventually)  observe  the  same  event  and  will  take  consistent 
actions. 
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3.  Consistency.  We  believe  that  consistency  is  generally  expressible  as  a  set  of  atomicity  and 
ordering  constraint  on  message  delivery,  particularly  causal  ones  of  the  sort  provided  by 
08 CAST.  Our  primitives  permit  a  process  to  specify  the  communication  properties  needed 
to  achieve  a  desired  form  of  consistency  Continued  research  will  be  needed  to  understand 
precisely  how  to  pick  the  weakest  primitive  in  a  designated  situation. 

4.  Serializability.  To  achieve  serializability,  one  implements  a  concurrency  control  algorithm 
and  then  forces  computations  to  respect  the  serialization  order  that  this  algorithm  choses. 
The  BCAST  primitive,  as  observed  above,  is  a  powerful  tool  for  establishing  an  order 
between  concurrent  events.  Having  established  such  an  order,  OBCAST  can  be  used  to  dis¬ 
tribute  information  about  the  computation  and  also  its  termination  (commit  or  abort).  Any 
process  that  observes  the  commit  oi  abort  of  a  computation  will  only  be  able  to  interact  with 
data  managers  that  have  received  messages  preceding  the  commit  or  abort,  hence  a  highly 
asynchronous  transactional  execution  results.  This  problem  is  discussed  in  more  detail  in 
[Birman-a]  [Joseph- a]  [Joseph-bj. 

7.  Implementation 

The  communication  primitives  can  be  built  in  layers,  starting  with  a  bare  network  providing 
unreliable  datagrams.  A  site-to-aite  acknowledgement  protocol  converts  this  into  a  sequenced, 
error-free  message  abstraction,  using  timeouts  to  detect  apparent  failures.  An  agreement  protocol 
is  then  used  to  order  the  site-failures  and  recoveries  consistently.  If  timeouts  cause  a  failure  to  be 
detected  erroneously,  the  protocol  forces  the  affected  site  to  undergo  recovery. 

Built  on  this  is  a  layer  that  supports  the  primitives  themselves.  OBCAST  has  a  very  light¬ 
weight  implementation,  based  on  the  idea  of  flooding  the  system  with  copies  of  a  message:  Each 
process  buffers  copies  of  any  messages  needed  to  ensure  the  consistency  of  its  view  of  the  system. 
If  message  m  is  delivered  to  process  p,  and  m  is  potentially  causally  dependent  on  a  message  m', 
then  a  copy  of  m'  is  sent  to  p  us  well  (duplicates  are  discarded).  A  garbage  collector  deletes 
superfluous  copies  after  a  message  has  reached  all  its  destinations.  By  using  extensive 


piggyback]  ag  and  a  simple  scheduling  algorithm  to  tontrol  message  transmission,  the  cost  of  an 
OBCAST  is  kept  low  -  often,  less  than  one  packet  per  destination.  BCAST  employs  a  two-phase 
protocol  based  on  one  suggested  to  us  by  Skeen  [Skeen-b],  This  protocol  has  higher  latency  than 
OBCAST  because  delivery  can  only  occur  during  the  second  phase;  BCAST  is  thus  inherently  syn¬ 
chronous.  In  ISIS,  however,  BCAST  is  used  rarely;  we  believe  that  this  would  be  the  case  in  other 
systems  as  well.  GBCAST  is  implemented  using  a  two-phase  protocol  similar  to  the  one  for 
BCAST,  but  with  an  additional  mechanism  that  flushes  messages  from  a  failed  process  before 
delivering  the  GBCAST  announcing  the  failure.  Although  GBCAST  is  slow,  it  is  used  even  less 
often  than  BCAST.  Preliminary  performance  figures  appear  in  [Birman- b], 

8.  Applications  of  the  approach 

Our  work  with  communication  primitives  has  convinced  us  that  the  resilient  objects  provided 
by  the  ISIS  system  exist  at  too  high  a  level  for  many  sorts  of  distributed  application.  For  exam¬ 
ple,  consider  the  cognac  still  shown  in  figure  2.  If  independent,  non-identical  computer  systems 
were  used  to  control  distillation,  two  aspects  would  have  to  be  addressed.  Fust,  it  would  be 
necessary  to  design  the  hardware  itself  in  a  way  that  admits  safe  actions  in  all  possible  system 
states.  Second,  however,  one  would  need  to  implement  the  control  software  in  each  processor  in  a 
way  that  ensures  mutual  consistency  of  the  operational  computing  units.  That  is,  given  that  the 

1  -  pressure/temp 

2  -  heater 
i  -  liquid  source 

t  <- 

-  V  U  1  V  W  «J 

6  -  bottling  unit 
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specification  describes  a  sequence  of  scions  to  take  in  some  scenario  (for  example,  detection  of 
excessive  pressure  in  the  distillation  vessel),  can  we  be  assured  that  the  operational  processors  wiu 
jointly  act  to  avert  a  disastrous  spill  of  cognac?  We  believe  that  fault-tolerant  process  groups  pro¬ 
vide  a  simpie,  elegant  way  to  address  problems  such  as  tins  one.  We  plan  to  complete  an  imple¬ 
mentation  of  the  protocols  by  the  summer  of  1986,  and  then  to  develop  a  collection  of  software 
subsystems  running  on  top  of  them. 
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